build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503
Conversation
There was a problem hiding this comment.
Makes sense, looks a lot better than before. Will be keen to get medcat-service in on this as well, with the complication that it also wants the models in the docker image.
Pooch looks great as well. I've got a few places (trainer, medcat service k8s) where I need to download assets on startup and might look to use this instead.
| v2_model = "mct2_model_pack.zip" | ||
|
|
||
|
|
||
| def _get_version(project_name: str = 'medcat') -> str: |
There was a problem hiding this comment.
As a dumber option as it's all hardcoded anyway - could we just pin an exact stable version instead of using _get_version?
I'd probably rather always pull cogstack-nlp/releases/download/3.11/my_zip reliably, instead of being a dynamic URL to debug later
Right now it kind of looks like the medcat version will change the test model used, but I'd rather make it a deliberate change. I really dont see the test models changing any time soon
There was a problem hiding this comment.
I see what you're saying. And I agree that these files are unlikely to change frequently.
But I do think it's realistic that they may be added to at some point.
Hard-coding the version here would then mean we'd need to bump the version every time we do a release.
Realistically, we're never really running the tests in the state that this solution is designed for (i.e an install from source distribution that runs the tests).
| with: | ||
| packages_dir: medcat-den/dist | ||
|
|
||
| # test-time models for download |
There was a problem hiding this comment.
Along with the pin version comment -
Potentially we could do a release/tag for "medcat-test-models" separately, and test models from the medcat/medcat-den versions/releases? (Basically what you did with the release bundle idea).
With the assumption that the test models basically never change. Main advantage is it would probably be easier to reason about "medcat 3 is failing test X, but I know the test model zip is still version 1 and that exact file hasn't changed for a year", but I do see the advantage of versioning it all together as well, I'm sure you've already thought about this one! Definitely hard if the zip in source code changes but the released one doesnt...
There was a problem hiding this comment.
If we end up using this for more different parts it may make sense to centralise the entire logic into a installable package.
I think the main thing I want to avoid is yet another release cadence. As much as it's unlikely we'll need to bump this often, having these not tied to the releases of the specific tools that use them would (at least in my eyes) make it harder to track what version of which tool should work with what version of these test time models.
There was a problem hiding this comment.
Cool - yeah that's fair for sure that it gets harder to track. I'm all good with this as you have it!
* fix(medcat): CU-869ddh1jv: Fix lock file issue. Issue was introduced in #503. Looks like because I didn't update uv.lock the entire dependency resolution was redone and broke. This PR should - hopefully - fix it. * CU-869ddh1jv: Disallow latest typer version * CU-869ddh1jv: Disallow all recent typer versions --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
The underlying issue
medcat-densource distribution are pushed to TestPyPI on every commit. And because they include test-time resources (test / fake models) they are rather large (~32MB). Over time this has meant we've reached PyPI's per project storage limit of 10GB. So now, because of this,medcat-denworkflows on themainbranch are failing because TestPyPI uploads are failing.Caveats to consider
The idea of packaging your tests (along with the resources required to run them) is quite common for source distributions. In fact, the default behaviour seems to be to include everything that is tracked by
git. There are a number of ways to get around this (i.e removing the files before building, pruning inMANIFEST.in), but they seem to be counter to the open source principles or not really following modern package building standards.The proposed plan
In order to make this a viable option, I plan to store test time models centrally to the repo. This means that they won't be included in the builds since they're outside the scope of the source. But it also has the added benefit of allowing us to reused the same test models across multiple projects within the repo (e.g
medcatandmedcat-den, but why notmedcat-serviceas well). On top of that there needs to be a way to access these files from a source distribution. And because that now doesn't include these test-time resources, they need to be fetached. The plan usespoochto do the fetching from the relevant version on GitHub, but the logic defaults to local files if available. This will involve including these files in relevant releases as well. On the way there we also need to make some changes on the exact paths that are used to interact with these models in the test suite (but that shouldn't be extensive).This is the plan: